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ABSTRACT 

Much of the data being created on the web contains inter¬ 
actions between users and items. Stochastic biockmodeis, 
and other methods for community detection and cluster¬ 
ing of bipartite graphs, can infer latent user communities 
and latent item clusters from this interaction data. These 
methods, however, typically ignore the items’ contents and 
the information they provide about item clusters, despite 
the tendency of items in the same latent cluster to share 
commonalities in content. We introduce content-augmented 
stochastic biockmodeis (CASE), which use item content to¬ 
gether with user-item interaction data to enhance the user 
communities and item clusters learned. Comparisons to sev¬ 
eral state-of-the-art benchmark methods, on datasets arising 
from scientists interacting with scientific articles, show that 
content-augmented stochastic biockmodeis provide highly 
accurate clusters with respect to metrics representative of 
the underlying community structure. 

Categories and Subject Descriptors 

H.2.8 [Database Management]: Database Applications— 
Data Mining; 1.5.3 [Pattern Recognition]: Clustering— 
Algorithms 

General Terms 

Algorithms 
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1. INTRODUCTION 

Motivated by an application to the arXiv [2], we consider 
the problem of finding hard clusters of scientific articles in 
the presence of user-interaction data and document content. 
To this end, we develop a generative model of user-item in¬ 
teraction as well as item content, such that each document 
is associated with a single scalar latent variable, indicating 
its cluster membership. Our model is essentially a stochas¬ 
tic blockmodel applied to the bipartite user-item interaction 
graph, combined with a content distribution on the docu¬ 
ment vertices. 

The application to arXiv motivating this work is that of 
developing a finer-grained categorization of papers than cur¬ 
rently exists, which will be used to offer users more fine¬ 
grained control over the daily and weekly feeds of newly 
submitted papers, and within a new information filtering 
system that will learn user preferences m- The primary 
user base will be those who frequently visit the website to 
stay up-to-date with the research community, as is common 
among physics researchers. 

Stochastic biockmodeis were introduced to discover latent 
community structure in graphs m , typically formed by peo¬ 
ple or other entities interacting with each other (each per¬ 
son is a node, and edges indicate interactions), or by people 
interacting with text documents, images, videos, or some 
other object (each person is a node, each object is a node, 
and edges indicate interactions, forming a bipartite graph). 
In this second kind of application, interaction information 
is the only information typically used, and information from 
the documents themselves is ignored. 

Traditional bipartite stochastic biockmodeis assume that 
different communities tend to interact differently with each 
text document, with some communities tending to interact 
more frequently with a given document type, and other com¬ 
munities interacting more frequently with other document 
types. This differential preference of communities for docu¬ 
ments induces a latent document clustering, with bipartite 
stochastic biockmodeis attempting to learn this latent clus¬ 
tering from interaction information alone. Our model adds 
an additional assumption: that documents in each cluster 
have distinct characteristics, observable in the words that 
occur in them. When this assumption is satisfied, we argue 
that it can and should be used to improve performance. 

While this assumption does not necessarily hold in all 
community detection applications, we argue that it holds 



in a wide variety of settings. In this paper, we apply this 
model to scientists interacting with scientific articles, where 
the words that tend to appear in articles preferred by a com¬ 
munity vary considerably from community to community. 
Our model could also be applied to communities interacting 
with other kinds of items, e.g., videos, but in our empirical 
studies we focus on text. 

Our model can also be seen as a co-clustering algorithm, 
because it provides a clustering of both users and docu¬ 
ments. However, our model is distinguished from all other 
co-clustering algorithms of which we are aware, in that it 
uses not just the interaction information, but also co-variates 
observable in the documents. Thus, our model is distin¬ 
guished from co-clustering approaches that use only docu¬ 
ment comments (e.g., based only the matrix of word co¬ 
occurrence) by the way it takes advantage of user co-access 
to find the mapping of contents to clusters that matches 
the communities’ preferences. It is distinguished from co¬ 
clustering approaches that only use user interactions in that 
document contents are used to refine and improve the co¬ 
clustering. 

An additional advantage of including document covariates 
into our model is new documents with no interaction history 
can be included into an appropriate document cluster, ad¬ 
dressing the cold-start problem. 

There has been growing interest in combining user inter¬ 
action and item content in the context of recommender sys¬ 
tems nail El HU HE], as well as combining citations and 
content in the context of community discovery (specifically 
on document networks) [161 [ 20 l [ 22 ] • To our knowledge, this 
paper is the first to approach the problem of clustering as a 
community detection task on the network of user-document 
interactions. 

In section 2 we provide a detailed description of the model, 
and in section 3 we describe how variational methods can be 
used for inference. In section 4 we apply the model to two 
real-world datasets, from the arXiv, and compare them to a 
several baseline clusters. 

2 . CONTENT-AUGMENTED STOCHASTIC 
BLOCKMODELS 

Suppose we are dealing with an application on the web 
such that there are D items and U users potentially inter¬ 
ested in the items. Suppose that over time the users have 
been shown and provided feedback on a subset of items. We 
encode their feedback with the variable T, defined by 

{ 1 if item i was shown to user j and relevant 
0 if item i was shown to user j and irrelevant 
A if item i was not shown to user j. 

Note we only consider a binary response (relevant / irrele¬ 
vant), and use the symbol A = Yij to denote the case where 
item i is not shown to user j. 

The model proceeds by assuming there are kd clusters 
such that each item i belongs to cluster Zi G {1,... ,kd} and 
there are ku communities such that each user j belongs to 
community Wj G {1,..., /cn}- We assume the community 
membership of a user and cluster membership of a paper 
completely determines the probability the user finds the pa¬ 
per relevant. Explicitly, the model assumes that 

p{Yij = l\zi = X, Wj = y, Yij ^ A) = q^y (1) 


for constants Qxy ranging over item clusters x and user com¬ 
munities y. For simplicity, we encode the qxy as a matrix 
Q — Vlxy]- We assume the observed Yij are all sampled 
independently. 

Finally, we endow the latent variables described above 
with the following Bayesian priors: 

• The cluster zi of item i follows a uniform distribution 
on kd elements. 

• The community Wj of user j follows a uniform distri¬ 
bution on ku elements. 

• The cluster-community interest probabilities Qxy follow 
a Beta((a, /3) for some a, /3 > 0. 

We further simplify things by assuming that ku = kd, that 
is the number of user communities and document clusters is 
equal, and use the variable K to indicate this value. 

The model described up to this point is a stochastic block- 
model on a bipartite graph, without any notion of item con¬ 
tent. 

To augment the model with content, we suppose that each 
item can be represented by an F-dimensional feature vector, 
such that the nth entry counts how many time the nth trait 
occurred, for some set of F traits. Let di represent the 
feature vector for the zth item. 

In order to force that items in the same cluster should be 
similar in content, we assume that associated with each item 
cluster X is a probability vector px G [0,1]^, = 1, 

such that if item i is in cluster x then the feature vector di 
is created from Ni samples of a Multinomial(p^c) distribution 
(the process by which Ni is chosen is unimportant). We as¬ 
sume the observed di are all sampled independently, and we 
place a Dirichlet( 7 ) prior on each px, for some 7 G (R>o)^. 

This fully describes the content-augmented stochastic block- 
model (CASB). In the following section we describe how 
variational inference and Gibbs sampling can be used to in¬ 
fer the latent variables. A graphical depiction of the CASB 
can be seen in Figure 1, illustrating the dependencies be¬ 
tween all latent variables. 

2.1 Related Work 

There is growing literature at the intersection of clustering 
and community-detection, as well as combining community- 
detection approaches with node attributes (content). In the 
context of document networks, the links considered are most 
often citations, as opposed to user interactions. 

In [T^ the authors introduce two models which jointly de¬ 
scribe text and citations. The first combines latent Dirichlet 
allocation [5] with mixed-membership stochastic blockmod- 
els [ 1 ]. They find this model leads to relatively intractable 
inference. In response, they introduce another model called 
Link-PLSA-LDA which associates a multinomial distribu¬ 
tion with each article, from which the articles citations are 
drawn. Both models use the graph structure and article con¬ 
tent to learn latent vector representations for each of articles. 
While powerful, these do not immediately lend themselves 
to hard clusters. 

The problem of clustering an arbitrary graph with node 
attributes is studied in [22] . Rather than taking a probabal- 
istic approach the authors augment the underlying graph by 
adding a node for each attribute, with links to each of the 
original nodes containing the attribute. Vertex closeness is 



Table 1: Expected Natural Parameters for Complete Conditionals and Relevant Expectations. Entries in the “Relevant 
Expectations” column can be used to compute entries in the “Expected Natural Parameters” column. T represents the 
digamma function. 
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given by a neighborhood random walk model, and the clus¬ 
ters are computed via the resulting distance function. 

In [20] the problem of community detection is approached 
by introducing a discriminiative model combining link and 
content information. Initially they introduce the Popularity- 
based Conditional Link model (PCL). PCL assumes there is 
a latent variable describing each node’s community mem¬ 
bership, and the probability of a link between two nodes 
depends on each node’s popularity and community mem¬ 
bership. Content is introduced into the model by setting 
the probability of belonging to a specific community as the 
exponential of a linear function of the node’s content vector. 

3. INFERENCE 

It is intractable to directly optimize the likelihood of the 
latent variables. Instead, we appeal to variational inference 
techniques to find approximate estimates of the latent vari¬ 
ables m- In variational inference, we associate to each 
latent variable a family of variational distributions each pa¬ 
rameterized by a free variational parameter. These param¬ 
eters are then optimized to find the closest member of the 
family to the posterior (in terms of KL-divergence). We will 
use mean-field variational inference, which assumes that the 
complete joint variational distribution factors. 

We assume the following variational parameters and dis¬ 
tributions: 


tational convenience we will write expressions involving q 
with the understanding that the distribution is conditioned 
on the variational parameters. 

Recall the evidence lower bound (ELBO) C{q) is defined 
as 

C{q) =Eg[\ogp{d,Y,z,w,P,Q)] - Eq[\og q(z, w, P,Q)] (6) 

and is equal to the KL-divergence between q and the poste¬ 
rior, up to an additive constant. In order to find the optimal 
variational parameters we optimize C(q) using coordinate 
ascent. Eollowing [TTJ [18] the update for each variational 
parameter in coordinate ascent equals the variational expec¬ 
tation of the natural parameter of the complete conditional 
corresponding to the relevant latent variable. The complete 
conditional distributions for each of the latent variables are: 

F 

p{zi\z-i,w,P,Q,d,Y) If 

r-Yij = l 

(1 — (IziWj ) 

3-Yij=^ 

p(wj\w-j,z,P,Q,d,Y) oc n QziWj n (l-qziWj) (8) 


Zi\4>i ~ Multinomial(0i), 

(2) 

Wj\ipj ~ Multinomial((/?j), 

(3) 

(Yxy^^xy ^ Beta(Q;xy ?) ? 

(4) 

Px\Xx ^ Dirichlet(Ax). 

(5) 


qxy\Q-xy,w,z,P,d,Y ~ Beta («',/3') (9) 


Let q represent the distribution defined above. Eor no- 


Px\P-x,w, z,Q,d,Y ~ Dirichlet(7') 


( 10 ) 




































The parameters in equations and |To| are given by: 


a' = a+ Yij 

(11) 

{i,j):zi=x,Wj=y 

/?' = /?+ ^ l-Yij 

(12) 

(i,j):zi=x,Wj=y 

= ^ da 

(13) 


i\zi=x 


The natural parameter for each of these distributions is 
summarized in table To implement coordinate ascent, 
each update is simply given by the corresponding entry’s 
expected value under q. 

4. EVALUATION 

In this section we fit the CASE to two datasets taken 
directly from the arXiv, and compare the results to several 
baseline clusters. 

The first dataset consists of all 7 819 papers uploaded to 
the arXiv in the astro-ph.CO (cosmology) category between 
2009 and 2010, and all 621 users who visited the astro-ph.CO 
“/new” or “/recent” page at least 5 times and read at least 
30 articles. We form a positive link {Yij — 1) between a user 
and all the papers they read, and a negative link {Yij — 0) 
between a user and all the papers that appeared on the astro- 
ph.CO ’’/new” or ’’/recent” pages on the days they visited 
that they did not read. For documents that appeared on 
days when the user did not visit we set Yij — A, making for 
a total of 1 090 588 non-A links and 65 814 positive links. 

We constructed the second dataset analogously. We se¬ 
lected all 6 677 papers uploaded to hep-th (theoretical high- 
energy physics) between 2009 and 2010, and all 1449 users 
satisfying the criteria above, who additionally read at least 
70 articles. We defined Y in the same manner as above, 
leading to 4 579 019 non-A links and 318 703 positive links. 

We considered two standard methods of representing doc¬ 
uments to illustrate the flexibility of the model. One rep¬ 
resentation was simply bag-of-words (BOW), treating the 
document as a vector counting each of the words in its ab¬ 
stract, so du represents how many times the ^th word ap¬ 
peared in the abstract of the ith document, giving rise to 
document vector di. We truncated the abstract vocabulary 
to remove highly uncommon words, leading to a vocabulary 
of size 4 951 for astro-ph.CO documents and 3 282 for hep-th 
documents. (We also did evaluations with the full text of 
the papers, but were forced to more significantly truncate 
the size of the vocabulary to improve speed. We found bet¬ 
ter performance using abstracts than with full texts, using 
a truncated vocabulary). 

For the other representation we preprocessed the data by 
htting latent Dirichlet allocation, and treating documents 
as counts over topics. That is, for the zth document with 
content vector di we let du represent the number of times 
a word from the -£th topic appeared in the abstract. In our 
experiments we fit latent Dirichlet allocation with 50 topics. 

We chose to focus on the astro-ph.CO and hep-th cat¬ 
egories because physicists who study cosmology and high- 
energy physics are some of the heaviest users of the arXiv. 
Many of these users visit the website daily to stay up-to- 
date with the research community. As such, these categories 
have vast user data that is highly representative of the many 
subcommunities. These frequent users are also the primary 


target of the recommender system currently being built. 

To validate the quality of the CASE clusters, we intro¬ 
duced several benchmarks. 

In [16] Nallapati et. al introduce the Link-PLSA-LDA (L- 
P-LDA) model. L-P-LDA is a graphical model that com¬ 
bines latent Dirichlet allocation (LDA) [5] and the mixed- 
membership stochastic blockmodel (MMSB) [Tj. It learns 
latent vector representations of the documents satisfying 
both the topic structure learned by LDA and the community 
structure learned by MMSB. Of all the benchmarks this one 
is most similar to ours, in that the latent variables capture 
both community structure and node content. We applied 
this model to both arXiv datasets, treating users as docu¬ 
ments with empty content. Once the vector representations 
were learned, KMeans was used to arrive at clusters. 

In [To] Gopalan et. al introduce CTPF, a generative model 
of document and reader preferences. CTPF learns user pref¬ 
erence vectors and document topic vectors in the same latent 
space, for the purpose of recommendations. In this model 
a rating Vud between a user u and document d is drawn 
according to 

~ Poisson(r?J(0ci + e<i)) (14) 


for user preference vector r]u, document topic vector Od and 
a small offset vector e^. The purpose of this model is to learn 
a latent space for effective document recommendation, this 
model is similar to ours in that it explicitly considers the in¬ 
teractions between users and items, as well as item contents. 
We fit this model to the two arXiv datasets and again used 
KMeans to go from vector representations to clusters. 

We looked to [20] for a benchmark that explicitly learns 
clusters, based on link data and document content. Yang et. 
al propose PCL, a model of link formation where the prob¬ 
ability of a link depends on a node’s latent cluster member¬ 
ship (similar to our setting with the CASE). They extend 
this model to PCLDC by discriminatively incorporating con¬ 
tent: if Xi G is the zth node’s content, they assume 


p{zi = k) 


exp(ie^Xi) 
Ei exp{wfxi) 


(15) 


where Wk € R"^ is a weight-vector associated with the fcth 
cluster. Since this model requires that each node be as¬ 
sociated with a content-vector, we trained this model on 
the arXiv data by assigning each user’s content to be the 
0-vector. Unfortunately this seriously hindered the perfor¬ 
mance of the PCLDC model as a benchmark. It would be 
possible to modify this model to handle nodes without con¬ 
tent, but we did not do so. 

As another benchmark, we trained 50 dimensional article 
vectors with the PV-DBOW method described by Dai et al. 
[8] . These article vectors are trained to be predictive of the 
text within the article using a hierachical softmax estimation 
of the log-linear objective. 


D/ I \ exp(^'^(;•^’a) ^ 

P{w\a) = -^ Za = yex.p{vw ■ Va) • (16) 


Article vectors trained in this way have proven useful to 
semantic analysis tasks na as well as retaining semantic 
similarity of the articles [8] . The article vectors were trained 
on the text extracted from the pdfs of the articles, after 
lowercasing the text and inserting word boundaries at each 
non-alphanumeric character. Any ‘word’ appearing less than 




30 times was cut from the vocabulary. After training, the 
article vectors were clustered according to spherical k-means 
as described in Coates et ah [7]. 

To improve the utility of the article vectors, in the larger 
training example, the astro-ph.CO and hep-th articles were 
augmented with a set of 98 392 articles chosen to be repre¬ 
sentative of all of the categories on the arXiv. 

Finally, we also used KMeans on the LDA vectors as a 
benchmark. 

In order to select the proper value of K, we fit the CASE 
to each dataset for K = 2,..., 10, and for each of the learned 
clusters we calculated the ELBO as in [G] We selected the 
largest value of K that contributed at least a 5% increase to 
the ELBO. Eor both datasets this resulted in K = 6. 

In addition, we present a qualitative evaluation of the 
model applied to a third dataset. This dataset consists five 
years of conference proceeding data from the annual IN- 
EORMS conference m, the largest conference of its kind for 
practitioners of operations research. Papers at INEORMS 
are presented in sessions, where a session chair will choose 
three or four papers relevant to the session’s subject. We 
treat the set of authors as the users in our model, such that 
each author has a positive link to every paper presented in 
the same session as the author’s own paper and a negative 
link to every other paper. 

4.1 Community Discovery on INFORMS 

The INEORMS dataset is rich for study because the con¬ 
ference represented has a large number of sessions and pre¬ 
sented papers (more than 1 000 sessions, and just under 4 000 
papers m)- It includes many distinct research subcommu¬ 
nities, which CASE is designed to discover. It is also a field 
that is very applied in nature, and one in which the authors 
have expertise, making evaluation of the clusters easier than 
for the two physics datasets. We trained the CASE on this 
dataset, setting iF = 5 arbitrarily. 

To evaluate the quality of these clusters, we formed word 
cloud visualizations of the frequently occurring words in each 
cluster. More specifically, for each word w we formed the 
scalars ici,..., wk such that Wi is the proportion of papers 
containing the word w belonging to the ith cluster. Then, 
in the word cloud corresponding to cluster z, the weight for 
the wih word is given by Wi . This weighting scheme ignores 
popular stop words since their distribution will be uniform 
across all clusters, whereas words that frequently occur in 
the zth cluster but do not occur in other clusters will have 
high weight. We limit ourselves to words appearing in more 
than 50 papers. 

Eigurej^displays two such word clouds. One is clearly rep¬ 
resentative of the mathematical optimization research com¬ 
munity with words such as “inequality”, “separation” and 
“relax”. The other is clearly representative of the trans¬ 
portation logistics community, with words such as “airline”, 
“congest” and “real-time”. We have set up a small website 
for a more complete view of the clusters, as well as the other 
three word cloudf] 

While purely qualitative, these word clouds show that the 
CASE is able to retrieve real-word research communities 
with high accuracy. 
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Cluster Type 

Misplaced Papers 

KMeans PV-DBOW 

3 829 

PCLDC 

3100 

KMeans PV-DBOW (auxiliary) 

1357 

KMeans Poisson Eactorization 

1024 

KMeans Link-PLSA-LDA 

985 

KMeans LDA 

936 

CASE (LDA docs) 

915 

CASE (bag-of-words docs) 

884 


Table 2: Number of misplaced papers for each set of clusters. 
The number of misplaced clusters is taken to be the mini¬ 
mum of gi + C 2 and ci + g 2 where gi and Ci are the number 
of galaxy and cosmology papers in cluster z, respectively. 

4.2 Capturing Misplaced Papers 

This evaluation focuses on two astrophysics subcategories 
on the arXiv: Cosmology and Nongalactic Astrophysics (astro- 
ph.CO); and Astrophysics of Galaxies (astro-ph.GA). 

In creating these categories, the arXiv administrators’ in¬ 
tention was for all papers about galactic astrophysics to go 
to astro-ph.GA. However, in the past, a significant portion 
of the astrophysics community had a different interpreta¬ 
tion: astro-ph.GA was for papers discussing our galaxy, the 
Milky Way, while papers discussing other galaxies should go 
to astro-ph.CO. 

In late 2013, arXiv.org’s moderators began enforcing their 
interpretation of these two subcategories, recategorizing pa¬ 
pers about galaxies from astro-ph.CO to astro-ph.GA [9]. 

We hypothesized that the research communities interested 
in nongalactic and galactic papers differ, as do the words 
in their papers, and so the CASE should be able to sepa¬ 
rate older papers from astro-ph.CO into those nongalactic 
papers that were correctly submitted to astro-ph.CO, and 
those galactic papers that should have been submitted to 
astro-ph.GA. Moreover, it should be able to do this in an 
unsupervised way, based only on usage and item content, 
without being given examples of papers in each class. 

To test this hypothesis, we fit the CASE and each of the 
benchmarks to our cosmology dataset consisting of papers 
submitted to astro-ph.CO over 2009-2010, setting K — 2. 
We then compared each of these clusterings to a ground 
truth classification of papers (described below) into those 
that were properly submitted to astro-ph.CO, and those that 
should have been submitted to astro-ph.GA. 

To create our ground truth, we used a Naive Bayes clas¬ 
sifier trained on papers appearing in the arXiv in late 2013 
and early 2014, which were manually reclassified by the 
arXiv moderators. We then ran this Naive Bayes classi¬ 
fier on the papers in our 2009-2010 dataset. Note that, 
although the Naive Bayes classifier is able to automatically 
classify papers as to whether they belong in astro-ph.GA 
or astro-ph.CO with high accuracy, this classifier required 
hand-curated training labels from the arXiv moderators, in 
the form of correct classifications of a large number of pa¬ 
pers from 2013 and 2014. In contrast, in this evaluation, 
CASE and the benchmark methods do not have access to 
this training data, and instead must make a determination 
based only on what was available in 2009-2010. 

The distribution of cosmology-classified and galaxy-classified 
papers are presented in Eigurej^ In Table we present the 
number of misplaced papers for each clustering scheme. We 
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Figure 2: Word clouds demonstrating the research communities learned by CASE within the INFORMS dataset. Each word 
cloud corresponds to a distinct community, and shows words whose relative frequency are high in that communities’ papers. 
This qualitative result shows that CASE is able to distinguish meaningful research communities in the INFORMS dataset. 
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Figure 3: Distribution of galaxy and cosmology papers amongst clusters of the astro-ph.CO dataset. The red bars represent 
cosmology papers and the black bars represent galaxy papers. A method that performs well puts cosmology papers and galaxy 
papers in nearly distinct clusters, so that the red bar is much larger than the black bar in one of the clusters, and the black 
bar is much larger than the red in the other cluster. 

























see that the CASE applied to the bag-of-word representa¬ 
tions have the fewest misclustered papers, followed closely 
by the CASE applied to the LDA representations. 

Of note is the fact that the PCLDC clusters have a very 
high number of misclustered papers, despite considering link 
presence and document content. We hypothesize this is be¬ 
cause their discriminative incorporation of content does not 
generalize well to nodes without content. When the node 
has zero content, the distribution from ( |15[ ) will be uniform. 
Since the links in our datasets only exist between users and 
documents, PCLDC will have a hard time exploiting the 
structure of the graph when each user node is uniform across 
all clusters. 

Interestingly, KMeans applied to the LDA representations 
also has very few misclassified documents. This suggests 
that there is a lot of signal in the content of the documents 
pointing to the ground truth communities. As the table 
shows, the CASE is able to exploit the user data to discover 
these communities even more accurately. 

4.3 Author-based Evaluation 

To further evaluate the quality of our clusters we looked 
to authorship data, as the papers a researcher writes are a 
strong indicator of the communities to which they belong. 

Specifically, for each of our datasets we took the set of 
authors who had written two or more papers. For each of 
these authors a we formed the distribution ai,..., ax where 
ai is the proportion of documents a has authored belonging 
to the zth cluster. (This is the same methodology as deter¬ 
mining weights for the word clouds). 

Now, if a clustering is representative of the underlying 
community structure, one would expect an author’s distri¬ 
bution to be highly concentrated on one or maybe two clus¬ 
ters. This is because scientific researchers are experts in one 
or two fields which contain the majority of their work. They 
sometimes branch out and write papers in other communi¬ 
ties, but this is an infrequent activity. 

To measure this property, we compute the average entropy 
7i of an author’s distribution for each cluster. Entropy is a 
measure of the disorder of a distribution. On one extreme, 
if an author’s publications all reside in one cluster the re¬ 
sulting entropy will be 0. On the other extreme, if an au¬ 
thor’s distribution is uniform the resulting entropy will be 
logs (6) ~ 2.6 since we fit the model with 6 clusters. The 
entropy for an author a is defined as 

"Ha = - a* logaCoi). (17) 

i 

In Figure the average author entropy is plotted for 
each clustering scheme. In this plot we see a slight reversal 
in the quality of the benchmarks compared to the parti¬ 
tioning evaluation. The PV-DEOW clusters trained with 
auxiliary data has almost identical entropy to CASE with 
LDA representations, and PV-DEOW trained solely on the 
astro-ph.CO dataset performs better than all other bench¬ 
marks other than the Link-PLSA-LDA clusters. However, 
the quality of the CASE clusters remain the same: CASE 
clusters with EOW representations do better than all clus¬ 
ters, and CASE clusters with LDA representations are tied 
for second as previously mentioned. 

In Figure \4b\ we see a similar pattern. The Link-PLSA- 
LDA clusters have marginally better performance than the 
CASE clusters with EOW representations, which are both 


in turn better than all other clusters. The LDA KMeans 
clusters come in third with slightly better performance than 
the CASE clusters with LDA representations. 

It’s surprising that the LDA KMeans clusters have better 
performance than CASE LDA clusters and KMeans CTPF 
clusters, since the LDA clusters do not leverage user data 
at all. However, we still see that leveraging user interaction 
data is worthwhile, as the Link-PLSA-LDA and CASE with 
EOW representation clusters have better performance than 
LDA clusters. 

The fact that CASEs learn clusters minimizing the aver¬ 
age author-cluster distribution’s entropy again shows that 
the clusters we are learning are truly representative of the 
underlying community structure in these subcategories. 

4.4 Coreadership Similarities 

To understand the CASE clusters better, we wanted to 
examine the extent to which high coreadership between two 
documents determines whether they belong to the same clus¬ 
ter. 

Recall for documents d and b with readers 7ld and Tib 
respectively, the Jaccard similarity between them is given 

ifut}- 

The Jaccard similarity is a measure of overlap between two 
sets that is agnostic to their size. That is, if one paper 
has been read by every user, a large intersection between 
this paper’s readership set with another paper’s readership 
set would not contain as much signal as a large intersection 
between two papers with small overall readership. 

To evaluate this criteria, we held out 100 users from each 
of the arXiv datasets, and reran inference for the CASE and 
all of the benchmarks. We then selected 3 000 documents 
from each of the arXiv datasets at random. For each of 
these selected documents, we construct another set consist¬ 
ing of all those documents whose Jaccard similarity with 
the original is at least 0.5. Some of the originally selected 
documents did not have Jaccard similarity greater than 0.5 
with any other documents in the corpus so they were dis¬ 
carded, leaving us with 2 442 documents from hep-th and 
1 750 documents from astro-ph.CO. From each of these sim¬ 
ilarity sets we selected one document at random, giving rise 
to a tuple containing the original document and another doc¬ 
ument, which together possess Jaccard similarity of at least 
0.5. 

After arriving at these tuples with high-coreadership, we 
simply calculated the proportion which belonged to the same 
cluster. The results are summarized in Figure Of note is 
that CASE with LDA representations has the lowest coread¬ 
ership similarity proportion for both astro-ph.CO and hep- 
th. Not quite as extreme, the CASE with EOW representa¬ 
tions had the second highest coreadership similarity in the 
astro-ph.CO dataset, and fifth highest in the hep-th dataset. 
These results show that the CASE clusters are not optimiz¬ 
ing for high coreadership within clusters. 

This can be explained due to the assumptions inherent 
in the model. Recall the variable Qxy represents the proba¬ 
bility of a document in cluster x being clicked by a user in 
cluster y. Hence if two documents have high Jaccard sim¬ 
ilarity, it is not necessarily indicative that they arise from 
the same community. Rather, it is possible that both docu¬ 
ments belong in separate clusters, but there is a community 




Figure 4: Average author-cluster distribution entropies of the various clusterings for the two arXiv datasets. Lower is better. 



Figure 5: Coreadership Similarities of the various clusterings for the two arXiv datasets. 













of users interested in both of these clusters. As we see in 
the author-based evaluation section, this assumption does 
not prevent the CASE from learning the true underlying 
community structure. 

5. CONCLUSION 

In this paper we have presented the content-augmented 
stochastic blockmodel (CASE), a probabalistic model of user- 
item interactions and item content. The cornerstone as¬ 
sumption of this model is that users and items exist in com¬ 
munities such that their community memberships determine 
the probability of an interaction, and the content of items 
in the same cluster is generated from the same distribu¬ 
tion. We fit this model to two real-world datasets taken 
from the arXiv. We found that the learned clusters had the 
highest accuracy in distinguishing between two real-world 
communities contained in the dataset, and they gave rise to 
author-cluster distributions with low entropy. Eoth of these 
results indicate that the model’s assumptions are valid, and 
the learned clusters are of high-quality. 
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